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ABSTRACT 

In this work we present a new structure for multiplication 
in finite fields. This structure is based on a digit-level LFSR 
(Linear Feedback Shift Register) multiplier in which the area 
of digit-multipliers are reduced using the Karatsuba method. 
We compare our results with the other works of the literature 
for F397. We also propose new formulas for multiplication 
in F 3 e 97. These new formulas reduce the number of F397- 
multiplications from 18 to 15. The finite fields F397 and 
F 3 6 97 are important fields for pairing based cryptography. 

Keywords: finite field multiplication, FPGA, pairing 
based cryptography. 

1. INTRODUCTION 

Efficient multiplication in finite fields is a central task in the 
implementation of most public key cryptosy stems. A great 
amount of work has been devoted to this topic (see JH or 
||2) for a comprehensive list). The two types of finite fields 
which are mostly used in cryptographic standards are binary 
finite fields of type F2™ and prime fields of type F p , where p 
is a prime (cf. [3 1). Efforts to efficiently fit finite field arith- 
metic into commercial processors resulted into applications 
of medium characteristic finite fields like those reported in 
[4 1 and [5|. Medium characteristic finite fields are fields of 
type F p m , where p is a prime slightly smaller than the word 
size of the processor, and has a special form that simplifies 
the modular reduction. Mersenne prime numbers constitute 
an example of primes which are used in this context. The 
security parameter is given by the length of the binary rep- 
resentations of the field elements, and the extension degree 
m is selected appropriately. Due to security considerations, 
the extension degree for fields of characteristic 2 or medium 
characteristic is usually chosen to be prime. 

With the introduction of the method of Duursma and Lee 
for the computation of the Tate pairing (cf. [6|), fields of 
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type ¥3™ for m prime have attracted special attention. Com- 
puting the Tate pairing on elliptic curves defined over ¥3™ 
requires computations both in F 3 m and in F 3 e m . In Q cal- 
culations are implemented using the tower of extensions 

F 3 m C F 3 2 m C F 3 6m 

and the inherent parallelism of multiplication in extension 
fields is used to accelerate the operations. Hardware designs 
and especially FPGA-based ones are suitable platforms for 
parallel implementation of algorithms. In that work mul- 
tiplications in the first and the second field extensions are 
computed via 3 and 6 multiplications in the ground fields, 
respectively, requiring 18 multiplications in F397. 

In our current work, which is mostly based on [7 1, on the 
one hand, we use asymptotically fast methods to improve 
the performance of multiplication in F397 , and on the other 
hand, we propose new multiplication formulas to speedup 
multiplication in F 3 e 97. Using the new formulas, multipli- 
cation in F 3 6 97 is done with only 15 multiplications instead 
of 18. We use the same extension tower, using 3 multipli- 
cations in F397 to multiply elements in F32 97, but only 5 
multiplications in F32 97 for F 3 e 97. Our proposed method 
has a slightly increased number of additions in comparison 
to the Karatsuba method. Notice however that a multiplica- 
tion in F397 requires many more resources than an addition, 
therefore the overall resource consumption will be reduced. 
The details of our method to generate the new formulas have 
been omitted to limit the complexity and diversity of mate- 
rials in this paper, and have been submitted as another paper 
for CHES 2007. 

A consistent amount of work has been done on hardware- 
based multiplication in finite fields, especially those of char- 
acteristic 3. The authors of (SJ propose a least significant 
digit-element (LSDE) multiplier for F 3 m . This multiplier di- 
vides the input polynomials into digits of length D. Whereas 
the digits of one input polynomial are processed in parallel, 
the digits of the other input polynomial are handled serially. 
Then the result is reduced modulo the irreducible polyno- 
mial. The same structure has also been used in [7| for multi- 
plication in F397 . Our multiplier, on the other hand, is based 



on the digit-serial implementation of LFSR (Linear Feed- 
back Shift Register) multiplier which is widely used in the 
literature (see |9j or iflOl ). and performs the modular reduc- 
tion during the multiplication. The first contribution of our 
current work is the application of the Karatsuba multiplier 
inside the digit-multipliers, which results in smaller area for 
these multipliers. Our results demonstrate the efficiency of 
this design compared to other works. The second contribu- 
tion is the application of a method using only 5 multiplica- 
tions in F32 or for multiplication in F 3 e .97. This results in 
an area-saving of almost 17% compared to the Karatsuba 
method which is used in |7|. 

Our work is organized as follows. Section [2] is devoted 
to the general structure of our multiplier for F397. In Sec- 
tion [3] we describe some improvements on the traditional 
LFSR multiplier and compare our results with other works 
from the literature. In Section|4]the new formulas for F 3 e 97 
together with suggestions for a new multiplier are presented, 
and Section|5]concludes the paper. 



2. MULTIPLICATION IN F397 

The finite field F397 can be represented as a vector space 
overF3. In this representation, elements of F397 are vectors 
of length 97 over F3. Addition of elements is computed by 
adding corresponding vectors. Multiplication is more com- 
plicated, and depends on the selected basis for F397 . There 
are two popular bases which are used often in the literature, 
namely polynomial and normal bases. A polynomial basis is 
generally more suitable for multiplication, hence we choose 
this basis in our work. 

In the polynomial basis, elements of F397 are represented 
as polynomials of degree at most 96 over F3. Two ele- 
ments are added by adding of the corresponding polynomi- 
als. Multiplication is based on polynomial multiplication 
followed by reduction modulo the irreducible polynomial, 
which generates the polynomial basis. In our case the irre- 
ducible polynomial, which we denote by f(x), is 

a; 97 + a; 16 + 2. (1) 

In the next sections we show the details of polynomial arith- 
metic in our designs. 

2.1. Arithmetic in F 3 

The element a £ F3 is represented using the vector (ai , ao) 
of two bits such that the elements 0, 1, and 2 are (0, 0), 
(0, 1), (1,0), respectively. In this representation the oper- 
ations addition, multiplication, and negation (multiplication 
by 2) are done, as shown in JTT), using Equations [2] [3] and 



b(x) a(x) 

1 , , L 




Fig. 1. Structure of a digit-level LFSR multiplier 

|U respectively. 

(01, ao) + (61, 60) = ((ao V 6 ) © t, (ai V 61) © t), (2) 
where t — (a V 61) © (a x V b ) 

(ai, 00) • (h, b ) = ((01 A b ) V (a A 61), (3) 
(ao A 6 ) V (01 A 61), 

- (ai,a ) = (a ,ai). (4) 

The implementation of Equations [2] and [3] is done using 2 
LUTs in the FPGA, whereas is only a permutation of 
bits. 



2.2. Structure of the multiplier for F397 

The structure of a digit-level LFSR multiplier is shown in 
Figure [T] In this figure the two input polynomials a(x), 
and b(x) are loaded into registers A and B, respectively, 
and divided into digits of length D. In each clock cycle the 
most significant digit of B is multiplied by the words of A, 
through digit-multipliers specified by M, and added to the 
content of the register in the feedback circuit. Inputs to the 
digit multipliers are two polynomials of degree D — 1 in x. 
The product is a polynomial of degree 2(D — 1). Powers 
x D to x 2 ^ -1 ) of each multiplier must be added to the pow- 
ers a: to x D ~ 2 of the next multiplier. This is done by the 
overlap circuit. In each clock cycle the register B and LFSR 
are shifted by D bits to the right. Shifting LFSR to right is 
equivalent to multiplication by x D which generates the pow- 
ers x 97 to x 9G+D . These powers are reduced modulo f{x) 
of (fTJ using the feedback circuit. The name Linear Feedback 
Shift Register descends from these feedback structures. For 
more information about the digit-level LFSR multiplier and 
its costs for classical methods see [ 10 1. In the next section 
we discuss our improvements to the traditional LFSR multi- 
plier. 



3. THE KARATSUBA METHOD 

In this section we use asymptotically fast methods to reduce 
the size of digit-multipliers. We use a similar approach to 
|[T2"1 and combine the classical and the Karatsuba methods to 
build small digit-multipliers. Two linear polynomials aix + 
ao and 61a; + bo are multiplied classically using the formula 



ai&icc 2 + (ai&o + aobi)x + a b 



(5) 



with 4 multiplications and 1 addition. The same product can 
also be computed via 

a\b\x 2 + ((ai +a )(6i +60) - ai&i - aoM^ + ^o- (6) 

The new formula is called the Karatsuba method (see [Q~3]). 
It requires 7 operations instead of 5, but only 3 multiplica- 
tions, and uses fewer resources when the coefficients ao, ai, 
bo, b\ are replaced by polynomials. The classical method for 
multiplication of two polynomials of degree n — 1 requires 
0(n 2 ) operations. Recursive application of the Karatsuba 
method reduces the cost of a multiplication to 0(n 159 ) op- 
erations. We represent the classical multiplication of two 
polynomials of degree n — 1 by C n and the method of (|6]l 
by /C. The methods C n for n S N, and K, constitute a set 
of polynomial multiplication methods. We call this set T. 
Using the elements of T we define the set of recursive mul- 
tiplication methods T* which contains the elements of T and 
all recursive combinations of elements of T*. The recursive 
combination of the two methods A4 and TV, for polynomi- 
als of lengths m and n, respectively, is the multiplication 
method MAf for polynomials of length mn. Let 

a(x) = a„ m _ix mn_1 + • • • + ao, and 
b(x) = b mn _ x x mn - x + --- + b 

be given polynomials. In order to apply A4Af, we write 
these polynomials as 

a{x) = Am^X™- 1 -\ h A , and 

b(x) = B^il"- 1 + • • • + So, 

where X — x n and A , ■ ■ ■ A m ^i,B , ■ ■ ■ £? m _i are poly- 
nomials of degree n — 1. If the polynomials Aj and Bi were 
coefficients, the two polynomials a(x) and b(x) would be 
multiplied using A4. The product using the method M.M 
consists of several multiplications of the polynomials Ai and 
Bi, which are performed using TV". We implement the digit- 
multipliers using the elements of T* to reduce their size. Our 
approach is similar to |[T2l . 

In Table Q] we show the results of implementing F 3 97 
multipliers on a XC2VP20-6FF896 FPGA. In this table the 
first column is the digit-size D. In a digit-level multiplier 
with digit-size D, inputs are preceded by enough zeros so 
that their length becomes a multiple of D. Hence it is natu- 
ral to choose a value of D such that the difference \m/ D] — 



Table 1. Timing and area costs of digit-level LFSR multi- 
pliers in F 3 97 for different values of digit-size D 



D 


Multiplication 


# of slices 


Maximum 


# of clock 








frequency (MHz) 


cycles = r97/n] 


1 




327 


300 


97 


2 


C 2 


800 


174 


49 


4 


Ci 


1716 


125 


25 


7 


ICC4 


2954 


111 


14 


14 


ICICC4 


4006 


72 


7 



m/D is as small as possible. Our values for D are selected 
using this criteria and hence differ from other standard val- 
ues like multiples of 4 in other works (see [8 1 and ff\). The 
string in the second column shows the recursive combina- 
tion of the Karatsuba and classical methods which is ap- 
plied. It is important to notice that the method AX2, which 
we used for polynomials of degree 6, applies to polynomi- 
als of length 7. Therefore, we add a zero in front of the 
polynomial and then remove all the gates containing an op- 
eration with the coefficients which are known to be zero. 
Hence this multiplier requires fewer resources than a com- 
plete /CC2. This point distinguishes our approach from that 
in lfl2l . In the third, fourth, and fifth columns are the num- 
ber of slices, maximum working frequency of the multiplier, 
and the required clock cycles for our designs. 

The results of comparing our results with those in [7 1 are 
shown in Figure |2] Different digit-levels result in different 
circuits, which we compare with respect to both time and 
area. Area is the number of slices, whereas time is the prod- 
uct of clock cycles and minimum period. Both designs are 
on the same technology, but the speed grade of the FPGA in 
Q is not available. As it is shown, our designs have better 
area-time performance. These improvements result, on the 
one hand, by using asymptotically faster methods, and on 
the other hand, by integrating the modular reduction stage 
into the LFSR. When a small digit-serial multiplier is used 
even the small size of a modular reduction must be taken 
into account. 



4. MULTIPLICATION IN F 



3 e-97 



Multiplication in F 3 e 97 is done in the same way as in Q, as 
a tower of extensions of degrees 2 and 3, i.e. 



F397 

F 3 2-97 

F36 97 



F 3 /(x 97 + 

F 3 97/(y 2 + 
F 3 2.97/(Z 3 



.16 



1) 



1 



The elements of F 3 2 97 are polynomials of degree 1 in s over 
F 3 97, for s a root of y 2 + 1 in F 3 2 97. The polynomials are 
multiplied by applying (O and then reduced modulo s 2 + 1. 
The elements of F 3 e 97 are polynomials of degree 3 in r, a 



root of z A 



1 in F 3 e 97. They are multiplied using the 
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Fig. 2. Time vs. area comparisons of our multipliers with 
those in JT) 



formulas and then reduced modulo r 3 — r — 1. 

(oo + air + a 2 r 2 )(b + b\r + b 2 r 2 ) = 
Co + C\r + c 2 r 2 + c a r 3 + C4r 4 , where 
Pa = (aa + a>\ + 0-2) (bo + h + 62) 
Pi = (oo + sa i ~ a 2 )(b + sbi - b 2 ) 
Pi = («o - a>i + a 2 )(b -bi + b 2 ) 
P3 = («o - sai - a 2 )(b - s&i - b 2 ) 
Pi = a 2 b 2 , and 
c = P + Pi + P 2 + P 3 - Pi 

- Pi + sP 3 
P2-P3 

- P2 - sP a 



(7) 



ci =Po 

C2 =Pa 
C3 = Po 
c 4 = P 4 



sPi 
Pi 4 



Combining (|6]), (O we have the following theorem. 



Theorem 1 Let a, (3 € F 3 e 97 be given as: 

a =ao + a\S + a 2 r + a^rs + a±r 2 + a^r 2 
(3 =b a + bis + b 2 r + b a rs + 6 4 r 2 + b 5 r 2 s 



Let further their product 7 = ot(3 £ F 3 6 97 be 

7 = c + cis 4- c 2 r + c 3 rs 4- c 4 r 2 4- c 5 r 2 s. 



Then the coefficients Cq ■ ■ ■ C5 of the product can be com- 
puted using only 15 multiplications in F397. 

Closed-form formulas for this multiplication are shown 
in AppendixfA] Scalar multiplications are particularly sim- 
ple using these formulas. Scalar multiplications are multi- 
plications by — 1, s, and — s. Negation of coefficients and 
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Fig. 3. The proposed structure block for implementing the 
formulas of AppendixfA] 



consequently of polynomials is only a permutation of bits, 
as seen in Sectionf2] Indeed multiplication of an element in 

F32 97 by s is a permutation, too. Let a = ais + ao 6 F32 97, 
then 



set = ais 



a s 



mod s 2 + 1 



clqs — a\. 



All of the F397 -multiplications can be done in parallel. 
This property allows designers to implement as many of 
these multipliers as possible, according to their time-area 
constraints. On the other hand, these multipliers are used 
for other computations such as point addition and doubling 
on elliptic curves for pairing-based cryptography. Reading 
and writing intermediate values into register files in such ap- 
plications is time-consuming. To solve this problem we pro- 
pose a new multiplier which is shown in Figure [3] The new 
multiplier consists of three pipeline stages, namely, input, 
multiplication, and output. During the time of each multipli- 
cation in F397, the input stage loads the coefficients a% and bi 
from memory for the next multiplication, and computes the 
linear combinations in ([8]) to compute P^s. In this time the 
output stage adds the last computed product P,; to memory 
variables according to (0. In this structure the hatched mul- 
tiplexers can select either one of their inputs or the sum of 
the inputs. In this way all possible multiples of input poly- 
nomials can be selected and added to the accumulators. 



5. CONCLUSION 

In this paper we proposed a new structure for multiplication 
in F397 . This structure is based on digit-level LFSR multipli- 
ers, where the area of digit-multipliers are reduced using the 
Karatsuba method. Another advantage of this approach is 
performing the modular reduction during the multiplication. 
Our synthesis results showed the performance improvement 
compared to other designs in the literature. We have also 
presented new formulas for multiplication in F 3 e 97 using 
only 15 multiplications in F397 . When the Karatsuba method 
is applied 18 multiplications are required. Furthermore, we 
have introduced a feasible hardware structure for realizing 
our proposed formulas. Our formulas are for the case that 
F 3 6 97 is constructed from F32 97 using the irreducible poly- 
nomial z 3 — z — 1. In case that the finite field is constructed 
using z 3 — z + 1, the formulas require slight modifications. 
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A. MULTIPLICATION FORMULAS FOR F 36 97 

Let a, f3 £ F 3 6-97 be given as: 

2 2 

a =ao + a±s + a^r + a^rs + a^r + a^r s, 
(3 =bo + bis + b 2 r + b 3 rs + 6 4 r 2 + b 5 r 2 s, 



where ao, • ■ • ,65 £ F397 and s £ F32 97, r £ F 3 6 97 are roots of 
y 2 + 1 and z 3 — z — 1, respectively. Let their product 7 = ct[3 £ 
F36 97 be 



2 2 

7 = Co + CiS + C2T + C;ir.S + CiT + C^T S. 



Then the coefficients Co ■ ■ ■ C5 £ F397 of the product can be com- 
puted using the following formulas. 



Po = 


(o + a 2 + a 4 )(60 + 62 + 64) 






Pi = 


(ao + ai + i'2 + ai + ci4 + as) 
(6 + bi + b 2 + 6 3 + 64 + 65) 




CO 


P-2 = 


(cii + a 3 + a 5 )(6i + 63 + 65) 




Cl 


Pa = 


(ao + sa2 — a4)(6o + sb 2 — 64) 






Pi = 


(ao + (i\ + sa,2 + Sd3 — 0,4 — as) 

(60 + 61 + S&2 + S&3 — 64 — 65) 




C2 


Ps = 


(ai + s«3 — as)(6i + SO3 — 65) 




C3 


Pe = 


(ao — ct2 + 04) (60 — 62 + 64) 






Pi = 


(ao + ai — a2 — as + a4 + as) 
(60 + 61 - 62 — 63 + 64 + 65) 


(8) 


C4 


Ps = 


(ai - a 3 + a 5 )(6i - 63 + 65) 




C5 


P 9 = 


(a () — sa2 — 04) (60 — s&2 — 64) 






Pio = 


(ao + ai — sa2 — sa3 — a4 — as) 

(&0 + &1 - S&2 — S&3 - 64 ~ 65) 






Pn = 


(ai - sa 3 — as) (61 - sb 3 — 65) 






Pl2 = 


a4&4 






Pis = 


(a 4 + a 5 )(&4 + b 5 ) 






Pl4 = 


as&s 







-Po + P2 + (s + 1)P 3 - (s + 1)P 5 - 

( S - 1)P 9 + (S - l)Pll - Pl2 + Pl4 

Po -P1+P2- {s + 1)P 3 + (s + 1)P 4 - 
(s + 1)P 6 + (s - 1)P 9 - (s - l)Pio+ 
(s - l)P n - P12 - P13 + Pl4 

-Pi + Pi + Pi " P 8 + Pl2 - Pl4 

Po - Pi + P 2 - P 6 + P 7 - Ps - P12 

+Pl3 - Pl4 

Po - Pi - P 3 + Ps + P 6 - Ps - P> + P11 + 

Pl2 - Pl4 

P0 + P1-P1 + P3-P4 + Ps - Pe + P7- 
Ps + Pg - P10 + P11 - P12 + P13 - Pl4 



